White Wine: the quality of the taste by Gloria SANCHEZ

This report explores a dataset containing 4,898 white wines with 11 variables quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The aim of this report is to find which chemical properties influence the quality of the White wine

Univariate Plots Section

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median : 5.200   Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

Our dataset contains 13 variables: 11 chemical properties, the wine rating (quality) and the wine code (X)

The distribution of the values along the different variables is visible in the summary table that includes the mean, the median, the min and max value and the 1st and 3rd interquartil.

This dataset seems not to have missing values, as we could see in the summary table the min value is higher than 0. Citric acid has a minimum value of 0, we will check later if it is only one or several wines with value 0 to determine if that is a missing value or a correct 0 value

## Wine per quality
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

The quality of the different wine is mainly 5 and 6. To analyze the distribution of the values we visualize them on some plots with variation on scale, bin width and specific focus on some values.

Lets see the acidity properties

## Count for citric acid (g / dm^3) low values
##    0 0.01 0.02 0.03 0.04 0.05 
##   19    7    6    2   12    5

Fixed acidity of most wines moves between 6.3 and 7.3 in a normal distribution. There are some wines with very low fixed acidity (left) and some wines with very high fixed acidity values (right tail)

Volatile acidity of most wines moves between 0.8 and 1.1 in a normal distribution. There are some wines with higher values (right tail)

Citric acidity of most wines moves between 0.27 and 0.39 but there are more around 50 wines with values lower than 0.05 and 19 with value = 0 that we consider as right data and not missing value

Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. The sugar curve is skewed to the right, if we apply a log10 scale we can see that we have two normal curves

Chlorides, free sulfur and total sulfur have normal distribution with some values on the right tail

Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution

pH and Sulphates has normal distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol has a normal distribution with a long right tail. 75% of the wines have less than 11.4% of alcohol. The wine with biggest alcohol have 14.20%

The boxplot show the values distribution on the different properties

Some properties have outliers, it is possible to remove outlier for individually or after a multivariate analysis.

Removing outlier for some of the properties give a different interpretation of the property but it is more interesting if done with multiproperties analysis It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers are of interest, and it’s important that we understand their values and why they appear in the data set.

Univariate Analysis

The data set contains 4898 observations with 13 variables: 11 chemical properties, the wine rating (quality) and the wine code (X)

The chimical properties are:- fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol.

At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The distribution of the values along the different variables is visible in the summary table that includes the mean, the median, the min and max value and the 1st and 3rd interquartil.

This dataset have no missing values, as we could see in the summary table the min value is higher than 0, except for citric acid but as the values for that moves from 0 to 1.66 and then 0 as min value seems right value and not a mistake

The main feature is the quality of the wine that is a result of the chimical properties, knowing that is the mix of those chimical properties that generate a sinegy in the taste of the wine.

The quality of the different wine is mainly 5 and 6. To analyze the distribution of the values we visualize them on some plots with variation on scale, bin width and specific focus on some values.

Fixed acidity of most wines moves between 6.3 and 7.3 in a normal distribution. There are some wines with very low fixed acidity (left) and some wines with very high fixed acidity values (right tail)

Volatile acidity of most wines moves between 0.8 and 1.1 in a normal distribution. There are some wines with higher values (right tail)

Citric acidity of most wines moves between 0.27 and 0.39 but there are more than 200 wines with values around 0.5 and the even higher values (right tail)

Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. We create a plot of values from 0 to 10 to see in detail the residual sugar distribution.The sugar curve is skewed to the right

Chlorides, free sulfur and total sulfur have normal distribution with some values on the right tail

Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution

pH and Sulphates has normal distribution

Alcohol has a normal distribution with a long right tail. 75% of the wines have less than 11.4% of alcohol. The wine with biggest alcohol have 14.20%

The chimical features that are more interesting are the ones with a larger range and different values eg. residual sugar, alcohol

Residual sugar is mainly between 0 and 10 but there are some values over 10 and a max value of 65.8. We create a plot of values from 0 to 10 to see in detail the residual sugar distribution.The sugar curve is skewed to the right

Density values moves from 0.9871 to 1.0340, with a interquartil of ~0.0057. We fix the limits on the interquartil distance and play with the scale, the bin width and breaks to get good visibility of the density distribution that in detail show that some density are more frequent than others even when the big picture is a normal distribution

Some properties have outliers, it is possible to remove outlier for individually or after a multivariate analysis.

Removing outlier for some of the properties give a different interpretation of the property but it is more interesting if done with multiproperties analysis It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers are of interest, and it’s important that we understand their values and why they appear in the data set

Bivariate Plots Section

The analysis of two variable start with a plot of correlations among all the variables

The quality of the wine is positive correlated with the alcohol and negative correlated with the density. The residual sugar is positive correlated with the density and negative correlated with the alcohol.

Distribution of the properties in different levels of quality for the properties with hihg correlation (+ or -)

The values related to alcohol are not homogeneous for the different qualities, similar happens with density (without outlier) and other properties that has + or - correlation with quality.

Lest see that with more details and with the mean and the quantile for each one with outlier removed for density, chlorides and volatile acidity

The alcohol is higher for the better quality and the density is lower for the better quality

Lets focus on quality labels 5, 6 and 7 as are the most common

Lets see quality relation with alcohol and density in another visualization

It is more clear now with this visualization, then a good wine combine high level of alcohol and low density

Regarding relation between features we can observe the relation of the density with total sulfur.dioxide, residual sugar, fixed acidity and chlorides

Lower density, better quality. Low total sulfites, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better win

The stronger relations are between density and alcohol (negative correlation of 78%) and density and residual sugar (positive correlation of 84)

Bivariate Analysis

The quality of the wine is positive correlated with the alcohol and negative correlated with the density. The residual sugar is positive correlated with the density and negative correlated with the alcohol.

Lower density means better quality. Low total sulfur dioxide, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better wine. This point is to be investigated in the multivariate analysis

The stronger relations are between density and alcohol (negative correlation of 78%) and density and residual sugar (positive correlation of 84%)

Density has strong positive correlation with residual sugar and total sulfur and moderate with chlorides and fixed acidity

Alcohol has strong negative correlation with density and moderate negative correlation with total sulfur, residual sugar and chlorides

Residual sugar has moderate positive correlation with total sulfur and negative with alcohol and strong positive correlation with density

Chlorides has as moderate positive correlation with density and negative with alcohol

The fixed acidity, the volatile acidity and the citric acid have low correlation with the other variable, except for pH.

Free sulfur has low correlation with other variables except for total sulfur

Total sulfur has moderate positive with density and negative with alcohol

pH has no correlation except negative with fixed acidity

Sulfates has no correlation with any variable

It will be interesting to investigate the data with multivariate plots

Multivariate Plots Section

The analysis multivariate of white wine is going to be done to analyze the distribution of the wine per quality in relation of - alcohol and density, - density and residual sugar, - alcohol and chlorides, - alcohol and total sulfur dioxides,

Those 3 chart represent the same information but none of them are really clear, they are good examples of what important the colors, shape and size of the chart elements are basic

It is clear that there is a movement from left to right and up down along the quality of the wines, it means that better wines have less density and more alcohol but the line between a wine quality 5, 6 and 7 is not crystal clear

It is not so good visually but it is clear that low residual sugar and low density are good indications of a good wine

It is clear that there is a movement from right to left and top to down along the quality of the wines, it means that better wines have less density and less sugar but the line between a wine quality 5, 6 and 7 is not crystal clear

Multivariate Analysis

Lower density means better quality. Low total sulfur dioxide, low fixed acidity, low chlorides and low residual sugar linked to a lower density, ergo, a better wine. That is like that for all white wines. The analysis of the different white wines characteristic per quality does not give clear differences between them. Alcohol and residual sugar seams to interact and more we move from a bad wine to a better wine the figure moves to the less sugar and more alcohol


Final Plots and Summary

Plot One

Description One

This plot is the basic one that gives the information about what is the basic structure of the data, in a simple view we see that the distribution of the wine quality is a normal curve.

Next step is try to find what are the characteristic that makes a wine be in a category 5 or 6, what are the combinations of characteristic that could allow to define if a wine is 5 or 9.

Plot Two

Description Two

The two chemical characteristic that have more correlation with the quality of the wine are the alcohol (+44%) and the density (-31%). Those correlation levels are not strong

If we make a focus on the chart to see the interquantile space, the distribution of those characteristic in the different quality categories do not show big differences, the lines representing the linear regression for each category are very close and nearly parallel

We can see points of quality 5 and 6 all along and those are the most frequent wines

Plot Three

Description Three

These two chart are showing the the relation between quality, chlorides and total sulfur dioxide. The only difference is that the first show the whole perimetre of the dataset, including outliers and the second one show the data interquantile (25%-75%) It is important when analysing data to take into acount what is the impact of the outliers in the result. The first chart could induce to think that there are really big differences on wine quality related to those two characteristic, in the second chart we can see that does differences are not so big.


Reflection

The analysis of the 4898 white wines show that there are not much bad wine (183). Nearly the half of the wines are considered as 6 and a ~1500 as 5, both values are for good wines. There are less than 1000 wine that are valuate at 7 and only 5 as 9.

It is difficult to find a pattern or a characteristic that makes the difference between the wines, what could induce to think that there are other characteristic that makes the difference but are not considered in the dataset. The color, the flavor and the smell of a wine depends not only in the quantity of alcohol and sugar but which is the process of maturity of the wine, the year of collection of the wrapes and other linked with the date of taste and the expert situation.

The fact that there were 3 expert that taste the wine prevent a possible bias

The conclusion after analyzing the data is that if we go to a shop and buy a white wine we have more than 80% probability to buy a good wine. Unless you are a bad lucky person.

One possible future analyse could be done taking a sample of wines from different quality score but similar values on the chemical characteristic and make a details analysis of the variations between them that makes it to be in an score and not in the other

Another possible analysis could be done taking the 3 originals score of each wine and generate a data set of 3 times the size of this data set but keeping the information related to each wine, it means we can see the 3 score of each wine and trate the information as different wine. That could help to find the key for a wine that can be score 5 for one and 7 for another, for example

Personal reflection: a good white wine and a great white wine could have same chemical properties because what makes a whine great is the spirit inside. Personally I prefer the sweet wines with low level of alcohol, then probably I like more a wine 6 than a wine 7 or 8

List of web

http://zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/

https://www.tidyverse.org/packages/

http://www.sthda.com/english/wiki/colors-in-r

https://humansofdata.atlan.com/2018/03/when-delete-outliers-dataset/

https://statsandr.com/blog/outliers-detection-in-r/

https://stats.idre.ucla.edu/r/faq/how-can-i-explore-different-smooths-in-ggplot2/

https://rpubs.com/profversaggi/lesson_four_problem_set

https://blog.rstudio.com/2014/01/17/introducing-dplyr/

https://rpubs.com/profversaggi/exploring_multivariate_data

https://www.r-graph-gallery.com/75-split-screen-with-layout.html

http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf?utm_source=twitterfeed&utm_medium=twitter

https://ggplot2.tidyverse.org/reference/scale_brewer.html

https://www.statmethods.net/advgraphs/axes.html

https://github.com/wengsengh/Exploratory_Data_Analysis/blob/master/wineQualityWhites.rmd

http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually

https://olegleyz.github.io/exploratory-data-analysis-of-the-white-wine-based-on-physicochemical-properties.html

http://mml.citi.sinica.edu.tw/cahou/RReport.html

https://rpubs.com/prasad_pagade/wine_quality_prediction

http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization

https://ggplot2.tidyverse.org/reference/scale_gradient.html

https://medium.com/@wengsengh/wine-quality-exploration-with-r-dca52264dca8

https://online.stat.psu.edu/stat508/book/export/html/804

https://en.wikipedia.org/wiki/Wine_chemistry